Managing multiple data-frames

Presented to EdinbR R-Users Group, 2018-07-18

Russ Hyde, University of Glasgow

2018-07-18

Background and Links:

Preamble

See https://github.com/russHyde/polyply

# Dependencies:
# - purrr, methods, rlang, tidygraph, dplyr 
if(! "polyply" %in% installed.packages()){
  require("devtools")
  devtools::install_github(
    repo = "russHyde/polyply", dependencies = FALSE
  )
}

suppressPackageStartupMessages(
  library(polyply)
)

Data-Modelling

Tidy Data and the Normal-Forms

In tidy data1:

  • TD1 - Each variable forms a column.
  • TD2 - Each observation forms a row.
  • TD3 - Each type of observational unit forms a table.
  • [TD4 - A key permitting table-joins is present]

See also, Boyce-Codd Normal-Forms2 and relational-database-design.

  • ?? TD5 - A tidy way of encapsulating your nicely decomposed tables
  • ?? TD6 - An explicit workflow for combining your tables back together
1: http://vita.had.co.nz/papers/tidy-data.html
2: https://en.wikipedia.org/wiki/First_normal_form

Common Untidy Data Structures

Tidy-data / normal-forms in R

  • ↓ duplication

  • play nicely with some important things (ggplot2 etc)

But untidy data-structures are useful if they:

  • ↑ access efficiency

  • ↓ code complexity

  • play nicely with other important things

Biobase::ExpressionSet

Biobase::ExpressionSet()
eset_boxes A assayData (matrix) nrow=|genes| ncol=|samples| B featureData (data-frame) nrow=|genes| A->B C phenoData (data-frame) nrow=|samples| A->C ExpressionSet ExpressionSet ExpressionSet->A protocolData protocolData ExpressionSet->protocolData experimentData experimentData ExpressionSet->experimentData Annotation Annotation ExpressionSet->Annotation

Figure made with DiagrammeR

Biobase::ExpressionSet (cont.)

Conversion of the assayData to meet tidy-data standards:

m # our assayData
##       sample1 sample2 sample3
## gene1    12.2   111.0   129.0
## gene2    19.1    10.5   123.0
## gene3     0.5     3.4     1.1

Doesn’t meet tidy-data standards:

  • rows correspond to features, columns to samples
  • not all variables are in columns (since row-IDs are meaningful)
  • entries are the same ‘type’ of variable

Easy fix3:

m2 <- reshape2::melt(
    m,
    varnames = c("feature_id", "sample_id"),
    as.is = TRUE
  )

head(m2, 4)
ABCDEFGHIJ0123456789
 
 
feature_id
<chr>
sample_id
<chr>
value
<dbl>
1gene1sample112.2
2gene2sample119.1
3gene3sample10.5
4gene1sample2111.0
3: ... or as.data.frame / rownames_to_column / gather

But …

  • Matrix representation was more dense

  • Lost all encapsulation

  • (After modifying featureData / phenoData to match)
    • Have to join rather than index
    • Have to keep track of multiple data-frames, rather than one data-structure

That multi-data-frame thing

For a reasonably complex project:

  • tidy-data / normal-forms mean more data-frames

Wanted:

  • a lightweight approach to working with multiple ‘conceptually-related’ data-frames

  • that plays nicely with tidyverse verbs

  • that feeds into ggplot2

  • that plays nicely with untidy data-structures I use all the time

tidygraph already (sort of) does this

Graph theory

Basics of ‘graph theory’ speak

A graph is made up of two sets:

  • V, a set of vertices:
    • aka nodes, actors, …
  • E, a set of edges:
    • pairwise relationships between vertices
    • aka interactions, lines, arcs, …
  • Need to store attributes for both nodes and edges

tbl_graph data structure

tidygraph is really a wrapper around the package igraph

data("Koenigsberg", package = "igraphdata")
tg <- tidygraph::as_tbl_graph(Koenigsberg)

# Nodes data shows up first:
tg
## # A tbl_graph: 4 nodes and 7 edges
## #
## # An undirected multigraph with 1 component
## #
## # Node Data: 4 x 2 (active)
##   name                Euler_letter
##   <chr>               <chr>       
## 1 Altstadt-Loebenicht B           
## 2 Kneiphof            A           
## 3 Vorstadt-Haberberg  C           
## 4 Lomse               D           
## #
## # Edge Data: 7 x 4
##    from    to Euler_letter name           
##   <int> <int> <chr>        <chr>          
## 1     1     2 a            Kraemer Bruecke
## 2     1     2 b            Schmiedebruecke
## 3     1     4 f            Holzbruecke    
## # ... with 4 more rows

tbl_graph data structure

# If we make the 'edges' active, the edge-data shows up first:
activate(tg, edges)
## # A tbl_graph: 4 nodes and 7 edges
## #
## # An undirected multigraph with 1 component
## #
## # Edge Data: 7 x 4 (active)
##    from    to Euler_letter name           
##   <int> <int> <chr>        <chr>          
## 1     1     2 a            Kraemer Bruecke
## 2     1     2 b            Schmiedebruecke
## 3     1     4 f            Holzbruecke    
## 4     2     4 e            Honigbruecke   
## 5     3     4 g            Hohe Bruecke   
## 6     2     3 c            Gruene Bruecke 
## # ... with 1 more row
## #
## # Node Data: 4 x 2
##   name                Euler_letter
##   <chr>               <chr>       
## 1 Altstadt-Loebenicht B           
## 2 Kneiphof            A           
## 3 Vorstadt-Haberberg  C           
## # ... with 1 more row

The activate verb

Think of the tbl_graph as list[nodes, edges]

To modify the contents of a given data-frame, activate it:

tg %>%
  activate(edges) %>%
  mutate(weight = nchar(name))
## # A tbl_graph: 4 nodes and 7 edges
## #
## # An undirected multigraph with 1 component
## #
## # Edge Data: 7 x 5 (active)
##    from    to Euler_letter name            weight
##   <int> <int> <chr>        <chr>            <int>
## 1     1     2 a            Kraemer Bruecke     15
## 2     1     2 b            Schmiedebruecke     15
## 3     1     4 f            Holzbruecke         11
## 4     2     4 e            Honigbruecke        12
## 5     3     4 g            Hohe Bruecke        12
## 6     2     3 c            Gruene Bruecke      14
## # ... with 1 more row
## #
## # Node Data: 4 x 2
##   name                Euler_letter
##   <chr>               <chr>       
## 1 Altstadt-Loebenicht B           
## 2 Kneiphof            A           
## 3 Vorstadt-Haberberg  C           
## # ... with 1 more row

polyply and multiple, linked data-frames

polyply

Aim:

  • multiple data-frames in one data-structure

    • → class poly_frame: extends list`
    • poly_frame: [list[data-frame], merge_fn]
  • mutation / filtering

  • merging

Exported functions

  • as_poly_frame

    • convert a data-structure into a poly_frame
  • activate

    • choose a data-frame from within the poly_frame
  • filter

    • modify the contents of the active data-frame
  • merge

    • user defined data-frame combiner (default: reduce(inner_join)(df_list))
  • others to be added (mutate / select etc)

Examples

ExpressionSet Example

data("leukemiasEset", package = "leukemiasEset")
leuk <- leukemiasEset
leuk
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 20172 features, 60 samples 
##   element names: exprs, se.exprs 
## protocolData
##   sampleNames: GSM330151.CEL GSM330153.CEL ... GSM331677.CEL (60
##     total)
##   varLabels: ScanDate
##   varMetadata: labelDescription
## phenoData
##   sampleNames: GSM330151.CEL GSM330153.CEL ... GSM331677.CEL (60
##     total)
##   varLabels: Project Tissue ... Subtype (5 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: genemapperhgu133plus2

Construct a poly-frame from an ExpressionSet

leuk_pf <- list(
  exprs = reshape2::melt(
    exprs(leuk),
    as.is = TRUE,
    varnames = c("feature_id", "sample_id")
  ),
  pheno = tibble::rownames_to_column(
    phenoData(leuk)@data,
    var = "sample_id"
  )
) %>%
  as_poly_frame()

What did we just make?

purrr::map(leuk_pf, head)
## $exprs
##        feature_id     sample_id    value
## 1 ENSG00000000003 GSM330151.CEL 3.386743
## 2 ENSG00000000005 GSM330151.CEL 3.539030
## 3 ENSG00000000419 GSM330151.CEL 9.822758
## 4 ENSG00000000457 GSM330151.CEL 4.747283
## 5 ENSG00000000460 GSM330151.CEL 3.307188
## 6 ENSG00000000938 GSM330151.CEL 8.230721
## 
## $pheno
##       sample_id Project     Tissue LeukemiaType
## 1 GSM330151.CEL   Mile1 BoneMarrow          ALL
## 2 GSM330153.CEL   Mile1 BoneMarrow          ALL
## 3 GSM330154.CEL   Mile1 BoneMarrow          ALL
## 4 GSM330157.CEL   Mile1 BoneMarrow          ALL
## 5 GSM330171.CEL   Mile1 BoneMarrow          ALL
## 6 GSM330174.CEL   Mile1 BoneMarrow          ALL
##           LeukemiaTypeFullName                         Subtype
## 1 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 2 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 3 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 4 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 5 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 6 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)

Filter and plot:

my_plot <- leuk_pf %>%
  # At first, data-frame `exprs` is active
  filter(feature_id %in% c("ENSG00000000003", "ENSG00000000005")) %>%
  # Select a different data-frame for filtering:
  # - you can use non-standard-evaluation in `activate`
  activate(pheno) %>%
  # only look at myeloid leukaemias
  filter(LeukemiaType %in% c("AML", "CML")) %>%
  # default merge: fold an inner-join
  merge() %>%
  ggplot()

Filter and plot(cont.)

my_plot +
  geom_boxplot(aes(x = LeukemiaType, y = value)) +
  facet_wrap(~ feature_id) +
  ggtitle("These might not be the most interesting genes in the dataset ...")

Taxonomy and brains

data(Animals, package = "MASS")
animals <- Animals %>%
  tibble::rownames_to_column(var = "common_name") %>%
  mutate(
    common_name = str_replace(
      common_name, "Dipliodocus", "Diplodocus"
    )
  )
common_to_species <- data.frame(
  common_name = c("Mountain beaver", "Cow", "Grey wolf", "Goat", "Guinea pig",
    "Diplodocus", "Asian elephant", "Donkey", "Horse", "Potar monkey", "Cat"
  ),
  species = c("Aplodontia rufa", "Bos taurus", "Canis lupus",
    "Capra hircus",
    "Cavia porcellus", "Diplodocus longus",
    "Elephas maximus", "Equus africanus asinus",
    "Equus ferus caballus", NA, "Felis silvestris"
  )
)

Taxonomies (cont.)

taxon_data <- taxize::classification(
  x = common_to_species$species,
  get = "order",
  db = "ncbi"
)
  
taxonomy <- Filter(is.data.frame, taxon_data) %>%
  bind_rows(.id = "species") %>%
  select(-id) %>%
  filter(rank %in% c("order")) %>%
  tidyr::spread(key = rank, value = name)

head(taxonomy)
ABCDEFGHIJ0123456789
 
 
species
<chr>
order
<chr>
1Aplodontia rufaRodentia
2Canis lupusCarnivora
3Cavia porcellusRodentia
4Elephas maximusProboscidea
5Equus africanus asinusPerissodactyla
6Felis silvestrisCarnivora

Taxonomies & brains (cont.)

as_poly_frame(
  list(animals, common_to_species, taxonomy)
) %>%
  merge() %>%
  ggplot(aes(x = body, y = brain, col = order)) +
  geom_point() +
  xlim(0, NA) + ylim(0, NA)

Thanks